diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
index 073873e8..3edad436 100644
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -10,17 +10,14 @@
      - ✅ Provide tests for your changes.
      - 📝 Use descriptive commit messages.
      - 📗 Update any related documentation and include any relevant screenshots.
-     - 📗 Reflect the changes you made in the changelog.
-
 -->
 
-
 ## PR Checklist
+
 - [ ] Use descriptive commit messages.
 - [ ] Provide tests for your changes.
 - [ ] Update any related documentation and include any relevant screenshots.
 - [ ] Check if changes need to be made to docs (README or any guides in `/docs/`).
-- [ ] Reflect the changes you made in the changelog.
 
 ## What type of PR is this? (check all applicable)
 
@@ -52,8 +49,8 @@ _Please replace this line with instructions on how to test your changes, a note
 on the hardware and config this has been tested on, as well as any relevant
 additional information._
 
-
 ## Added/updated tests?
+
 - [ ] Yes
 - [ ] No, and this is why: _please replace this line with details on why tests
       have not been included_
diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md
index 53875e98..11ed4079 100644
--- a/docs/add_new_benchmark_guide.md
+++ b/docs/add_new_benchmark_guide.md
@@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the
 ## Overview
 
 The eval-framework supports two response types:
+
 1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation)
 2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options
 
 For detailed information about implementing each task type, please refer to:
+
 - [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks
 - [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks
 
@@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample
 
 This section provides a complete reference for all configurations available when creating benchmarks.
 
-
 ### Response Types
 
 The response type determines how your model interacts with the task and what type of output is expected.
@@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS
 
 Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.
 
-
 #### Completion Metrics
 
 These metrics work with generated text outputs from COMPLETION tasks:
@@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas
 
 These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:
 
-
 ```python
 from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
 # Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
@@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn
 
 ```
 
-
 ## Implementation Examples and Patterns
 
 ### Practical Example: GeographyQATask
@@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]):
         return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot)
 ```
 
-
 ### Add to Task Registry
 
 Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`:
@@ -280,32 +277,70 @@ The task will now be available through `get_task("GeographyQA")`.
 
 ### Testing your benchmark
 
-All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
+All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.
+
+> [!TIP]
+> CI runs this test across all registered tasks, but during development you can target only your task, for example:
+> `uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -k "YourTaskName"`
 
 #### Automatic Formatting Tests
-All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.
+
+All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.
+
+The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`.
+
+When you add a new task:
+
+1. Run the formatter hash test once for your task to generate/check hashes.
+2. If your task hash is new, it will be added to `task-prompts-hashes.json`.
+3. Commit the updated JSON file together with your task changes.
+
+Run the formatter hash test only for your newly created task (replace `YourTaskName`):
+
+```bash
+uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName"
+```
 
 #### Custom Task Tests (Optional)
+
 If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it.
 
+### Update benchmark documentation
+
+After adding a benchmark, you also need to update task documentation:
+
+1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them).
+2. Regenerate the task docs:
+
+```bash
+uv run -m eval_framework.utils.generate_task_docs
+```
+
+This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`.
+
 ## Benchmark Examples by Task Type
 
 Study these existing benchmarks in the codebase for more complex patterns:
 
 #### Simple Classification Tasks
+
 - **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods
 - **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects
 
 #### Reasoning Tasks
+
 - **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns
 
 #### Code Generation
+
 - **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation
 - **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation
 
 #### Long Context Tasks
+
 - **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks
 
 #### Custom Format Tasks
+
 - **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation
 - **JSON/CSV Tasks:** Custom format validation examples
diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md
index 01900a9e..3face99d 100644
--- a/docs/benchmarks_and_metrics.md
+++ b/docs/benchmarks_and_metrics.md
@@ -6,6 +6,55 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 ## Completion
 
+<<<<<<< docs_polish
+| **Task**            | **Capability** | **Benchmarks**                                                                                                                                                                                                         | **Long Context**                                                                                                                                    |
+| ------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Logical Reasoning   | Math           | `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench`                                                                                | `InfiniteBench_MathFind`                                                                                                                            |
+| Logical Reasoning   | Programming    | `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED` | `InfiniteBench_CodeRun`                                                                                                                             |
+| Logical Reasoning   | Puzzle         | `SPHYR`                                                                                                                                                                                                                |                                                                                                                                                     |
+| Output Control      | Structure      | `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval`                                                                                                                                               |                                                                                                                                                     |
+| Text Distillation   | Aggregation    |                                                                                                                                                                                                                        | `ZERO_SCROLLS_SPACE_DIGEST`                                                                                                                         |
+| Text Distillation   | Classification | `GPQA_COT`, `MMLU`, `MMLU_IDK`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA`                                                                                                                                |                                                                                                                                                     |
+| TextDistillation    | Closed QA      | `SQUAD`, `SQUAD2`                                                                                                                                                                                                      | `InfiniteBench_EnDia` `InfiniteBench_EnQA`                                                                                                          |
+| Text Distilation    | Extraction     | `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE`                                                                                                                                                                                    | `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1`                                                       |
+| Text Distillation   | QA             |                                                                                                                                                                                                                        | `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY` |
+| Text Transformation | Translation    | `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT`                                                                                                                   |                                                                                                                                                     |
+
+## Loglikelihoods
+
+| **Task**          | **Capability** | **Benchmarks**                                                                                                                                                                                                                                                                                                                                                                            | **Long Context**                  |
+| ----------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
+| Output Control    | Bias           | `WINOGENDER`, `WINOGENDER_IDK`                                                                                                                                                                                                                                                                                                                                                            |                                   |
+| Text Distillation | Classification | `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `ARC_IDK`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `GPQA_IDK`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `MMLU_PRO_IDK`, `OPENBOOKQA`, `OPENBOOKQA_IDK`, `PIQA`, `PIQA_IDK`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR`, `TRUTHFULQA_IDK` |                                   |
+| Text Distillation | QA             |                                                                                                                                                                                                                                                                                                                                                                                           | `QUALITY`, `ZERO_SCROLLS_QUALITY` |
+| Text Generation   | Open QA        | `CASEHOLD`                                                                                                                                                                                                                                                                                                                                                                                |                                   |
+| Logical Reasoning | Closed QA      |                                                                                                                                                                                                                                                                                                                                                                                           | `InfiniteBench_EnMC`              |
+| Logical Reasoning | Programming    |                                                                                                                                                                                                                                                                                                                                                                                           | `InfiniteBench_CodeDebug`         |
+| Logical Reasoning | Reasoning      | `BalancedCOPA`, `COPA`, `COPAEvalHarness`, `COPA_IDK`, `COPA_IDKEvalHarness`, `GOLDENSWAG`, `GOLDENSWAG_IDK`, `HELLASWAG`, `HELLASWAG_DE`, `HELLASWAG_EU20_DE`, `HELLASWAG_EU20_FR`, `HELLASWAG_IDK`, `WINOGRANDE`, `WINOGRANDE_IDK`, `WINOX_DE`, `WINOX_FR`                                                                                                                                                                                                                                        |                                   |
+
+## Long-Context
+
+| Task Name                      | Tag                              | Task                          | Capability               | Domain           | Common Few-Shot Counts | Avg #Words | Language |
+| ------------------------------ | -------------------------------- | ----------------------------- | ------------------------ | ---------------- | ---------------------- | ---------- | -------- |
+| Babilong                       | `Eval Suite Long Context`        | Text Generation, Long Context | Completion, Long Context | ?                | not supported          | 22003      | en       |
+| InfiniteBench_CodeDebug        | `InfiniteBench_CodeDebug`        | LogicalReasoning              | Programming              | ?                | not supported          | 127761     | en       |
+| InfiniteBench_CodeRun          | `InfiniteBench_CodeRun`          | LogicalReasoning              | Programming              | ?                | not supported          | 34851      | en       |
+| InfiniteBench_EnDia            | `InfiniteBench_EnDia`            | TextDistillation              | Closed QA                | ?                | not supported          | 73240      | en       |
+| InfiniteBench_EnMC             | `InfiniteBench_EnMC`             | TextDistillation              | Closed QA                | ?                | not supported          | 139966     | en       |
+| InfiniteBench_EnQA             | `InfiniteBench_EnQA`             | TextDistillation              | Closed QA                | ?                | not supported          | 149442     | en       |
+| InfiniteBench_MathFind         | `InfiniteBench_MathFind`         | LogicalReasoning              | Math                     | ?                | not supported          | 30017      | en       |
+| InfiniteBench_RetrieveKV2      | `InfiniteBench_RetrieveKV2`      | TextDistillation              | Extraction               | ?                | not supported          | 5010       | en       |
+| InfiniteBench_RetrieveNumber   | `InfiniteBench_RetrieveNumber`   | TextDistillation              | Extraction               | ?                | not supported          | 99199      | en       |
+| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation              | Extraction               | ?                | not supported          | 99196      | en       |
+| QuALITY                        | `QuALITY`                        | Text Distillation             | QA                       | Literature, Misc | not supported          | 4248       | en       |
+| ZeroSCROLLS GovReport          | `ZeroSCROLLS GovReport`          | Text Distillation             | QA                       | Government       | not supported          | 7273       | en       |
+| ZeroSCROLLS SQuALITY           | `ZeroSCROLLS SQuALITY`           | Text Distillation             | QB-Summ?                 | Literature       | not supported          | 4971       | en       |
+| ZeroSCROLLS Qasper             | `ZeroSCROLLS Qasper`             | Text Distillation             | QA                       | Science          | not supported          | 3531       | en       |
+| ZeroSCROLLS NarrativeQA        | `ZeroSCROLLS NarrativeQA`        | Text Distillation             | QA                       | Literature, Film | not supported          | 49384      | en       |
+| ZeroSCROLLS QuALITY            | `ZeroSCROLLS QuALITY`            | Text Distillation             | QA                       | Literature, Misc | not supported          | 4248       | en       |
+| ZeroSCROLLS MuSiQue            | `ZeroSCROLLS MuSiQue`            | Text Distillation             | QA                       | Wikipedia        | not supported          | 1749       | en       |
+| ZeroSCROLLS SpaceDigest        | `ZeroSCROLLS SpaceDigest`        | Text Distillation             | Aggregation              | Reviews          | not supported          | 5481       | en       |
+=======
 | **Task** | **Capability** | **Benchmarks** | **Long Context** |
 |-|-|-|-|
 |   Logical Reasoning   |   Math   |   `AIME2024`, `AIME2025`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`, `GSM8KReasoning`, `MATH`, `MATH500`, `MATH500Minerva`, `MATHLvl5`, `MATHMinerva`, `MATHMinervaBPB`, `MATHMinervaEvalHarness`, `TableBench`   |   `InfiniteBench_MathFind`   |
@@ -52,6 +101,7 @@ Additional task documentation can be generated with the script `utils/generate-t
 | ZeroSCROLLS QuALITY            | `ZeroSCROLLS QuALITY`            | Text Distillation | QA                                               | Literature, Misc | not supported | 4248                                         | en           |
 | ZeroSCROLLS MuSiQue            | `ZeroSCROLLS MuSiQue`            | Text Distillation | QA                                               | Wikipedia | not supported | 1749                                         | en           |
 | ZeroSCROLLS SpaceDigest        | `ZeroSCROLLS SpaceDigest`        | Text Distillation | Aggregation                                      | Reviews | not supported | 5481                                         | en           |
+>>>>>>> main
 
 ## Languages
 
@@ -61,6 +111,47 @@ Additional task documentation can be generated with the script `utils/generate-t
 
 ## Metrics
 
+<<<<<<< docs_polish
+| Metrics Type           | Metrics                           |
+| ---------------------- | --------------------------------- |
+| Completion Metrics     | Accuracy                          |
+|                        | Bleu                              |
+|                        | Chrf                              |
+|                        | Ter                               |
+|                        | F1                                |
+|                        | Rouge 1                           |
+|                        | Rouge 2                           |
+|                        | Rouge-L                           |
+|                        | Code Assertion                    |
+|                        | Language Checker                  |
+|                        | Length Checker                    |
+|                        | Math Reasoning                    |
+|                        | Placeholder Checker               |
+|                        | Text Counter                      |
+|                        | CSV Format                        |
+|                        | JSON Format                       |
+|                        | Postscript Format                 |
+|                        | Custom IFEval Checker             |
+|                        | Custom CWE Checker                |
+|                        | Custom NIAH Checker               |
+|                        | Custom Grid Comparison Checker    |
+|                        | Repetition Checker                |
+| Loglikelihood Metrics  | Accuracy Loglikelihood            |
+|                        | Accuracy Normalized Loglikelihood |
+|                        | Confidence-weighted Accuracy      |
+|                        | Probability Mass                  |
+|                        | Probability Mass Normalized       |
+| (IDK-Specific Metrics) | Distributional Correctness Score  |
+|                        | Ternary Score                     |
+| LLM Metrics            | Chatbot Style Judge               |
+|                        | Completion Accuracy Judge         |
+|                        | Conciseness Judge                 |
+|                        | Contains Names Judge              |
+|                        | Instruction Judge                 |
+|                        | SQL Format                        |
+|                        | World Knowledge Judge             |
+| Efficiency Metrics     | Bytes per Sequence Position       |
+=======
 | Metrics Type | Metrics                       |
 |-|-------------------------------|
 | Completion Metrics | Accuracy
@@ -97,3 +188,4 @@ Additional task documentation can be generated with the script `utils/generate-t
 || SQL Format
 || World Knowledge Judge
 | Efficiency Metrics | Bytes per Sequence Position   |
+>>>>>>> main