diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 073873e8..3edad436 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -10,17 +10,14 @@ - ✅ Provide tests for your changes. - 📝 Use descriptive commit messages. - 📗 Update any related documentation and include any relevant screenshots. - - 📗 Reflect the changes you made in the changelog. - --> - ## PR Checklist + - [ ] Use descriptive commit messages. - [ ] Provide tests for your changes. - [ ] Update any related documentation and include any relevant screenshots. - [ ] Check if changes need to be made to docs (README or any guides in `/docs/`). -- [ ] Reflect the changes you made in the changelog. ## What type of PR is this? (check all applicable) @@ -52,8 +49,8 @@ _Please replace this line with instructions on how to test your changes, a note on the hardware and config this has been tested on, as well as any relevant additional information._ - ## Added/updated tests? + - [ ] Yes - [ ] No, and this is why: _please replace this line with details on why tests have not been included_ diff --git a/docs/add_new_benchmark_guide.md b/docs/add_new_benchmark_guide.md index 53875e98..11ed4079 100644 --- a/docs/add_new_benchmark_guide.md +++ b/docs/add_new_benchmark_guide.md @@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the ## Overview The eval-framework supports two response types: + 1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation) 2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options For detailed information about implementing each task type, please refer to: + - [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks - [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks @@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample This section provides a complete reference for all configurations available when creating benchmarks. - ### Response Types The response type determines how your model interacts with the task and what type of output is expected. @@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals. - #### Completion Metrics These metrics work with generated text outputs from COMPLETION tasks: @@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks: - ```python from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle # Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German) @@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn ``` - ## Implementation Examples and Patterns ### Practical Example: GeographyQATask @@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]): return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot) ``` - ### Add to Task Registry Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`: @@ -280,32 +277,70 @@ The task will now be available through `get_task("GeographyQA")`. ### Testing your benchmark -All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file. +All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically. + +> [!TIP] +> CI runs this test across all registered tasks, but during development you can target only your task, for example: +> `uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -k "YourTaskName"` #### Automatic Formatting Tests -All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required. + +All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`. + +The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`. + +When you add a new task: + +1. Run the formatter hash test once for your task to generate/check hashes. +2. If your task hash is new, it will be added to `task-prompts-hashes.json`. +3. Commit the updated JSON file together with your task changes. + +Run the formatter hash test only for your newly created task (replace `YourTaskName`): + +```bash +uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName" +``` #### Custom Task Tests (Optional) + If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it. +### Update benchmark documentation + +After adding a benchmark, you also need to update task documentation: + +1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them). +2. Regenerate the task docs: + +```bash +uv run -m eval_framework.utils.generate_task_docs +``` + +This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`. + ## Benchmark Examples by Task Type Study these existing benchmarks in the codebase for more complex patterns: #### Simple Classification Tasks + - **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods - **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects #### Reasoning Tasks + - **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns #### Code Generation + - **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation - **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation #### Long Context Tasks + - **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks #### Custom Format Tasks + - **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation - **JSON/CSV Tasks:** Custom format validation examples diff --git a/docs/benchmarks_and_metrics.md b/docs/benchmarks_and_metrics.md index 01900a9e..3face99d 100644 --- a/docs/benchmarks_and_metrics.md +++ b/docs/benchmarks_and_metrics.md @@ -6,6 +6,55 @@ Additional task documentation can be generated with the script `utils/generate-t ## Completion +<<<<<<< docs_polish +| **Task** | **Capability** | **Benchmarks** | **Long Context** | +| ------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| Logical Reasoning | Math | `AIME2024`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`,`GSM8KReasoning`, `MATH`, `MATH500`, `MATHLvl5`, `TableBench` | `InfiniteBench_MathFind` | +| Logical Reasoning | Programming | `BigCodeBench`, `BigCodeBenchHard`, `BigCodeBenchInstruct`, `BigCodeBenchHardInstruct`, `HumanEval`, `HumanEvalInstruct`, `MBPP`, `MBPP_PROMPT_WITHOUT_TESTS`, `MBPP_SANITIZED`, `MBPP_PROMPT_WITHOUT_TESTS_SANITIZED` | `InfiniteBench_CodeRun` | +| Logical Reasoning | Puzzle | `SPHYR` | | +| Output Control | Structure | `IFEval`, `IFEvalDe`, `IFEvalFiSv`, `RenderableStructEval`, `StructEval` | | +| Text Distillation | Aggregation | | `ZERO_SCROLLS_SPACE_DIGEST` | +| Text Distillation | Classification | `GPQA_COT`, `MMLU`, `MMLU_IDK`, `MMLU_PRO_COT`, `MMMLU_GERMAN_COT`, `PAWSX`, `TRIVIAQA` | | +| TextDistillation | Closed QA | `SQUAD`, `SQUAD2` | `InfiniteBench_EnDia` `InfiniteBench_EnQA` | +| Text Distilation | Extraction | `DUC_ABSTRACTIVE`, `DUC_EXTRACTIVE` | `InfiniteBench_RetrieveKV2`, `InfiniteBench_RetrieveNumber`, `InfiniteBench_RetrievePassKey1` | +| Text Distillation | QA | | `ZERO_SCROLLS_GOV_REPORT`, `ZERO_SCROLLS_MUSIQUE`, `ZERO_SCROLLS_NARRATIVEQA`, `ZERO_SCROLLS_QASPER`, `ZERO_SCROLLS_QMSUM`, `ZERO_SCROLLS_SQUALITY` | +| Text Transformation | Translation | `Flores200`, `FloresPlus`, `WMT14`, `WMT14_INSTRUCT`, `WMT16, WMT16_INSTRUCT, WMT20, WMT20_INSTRUCT` | | + +## Loglikelihoods + +| **Task** | **Capability** | **Benchmarks** | **Long Context** | +| ----------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- | +| Output Control | Bias | `WINOGENDER`, `WINOGENDER_IDK` | | +| Text Distillation | Classification | `ARC`,`ARC_DE`, `ARC_EU20_DE`, `ARC_EU20_FR`, `ARC_FI`, `ARC_IDK`, `BELEBELE`, `ChemBench`, `FullTextMMLU`, `GPQA`, `GPQA_IDK`, `INCLUDE`, `MMLU`, `MMLU_DE`,`MMLU_EU20_DE`, `MMLU_EU20_FR`, `MMMLU`, `MMLU_PRO`, `MMLU_PRO_IDK`, `OPENBOOKQA`, `OPENBOOKQA_IDK`, `PIQA`, `PIQA_IDK`, `SCIQ`, `SCIQEvalHarness`, `TRUTHFULQA`, `TRUTHFULQA_EU20_DE, TRUTHFULQA_EU20_FR`, `TRUTHFULQA_IDK` | | +| Text Distillation | QA | | `QUALITY`, `ZERO_SCROLLS_QUALITY` | +| Text Generation | Open QA | `CASEHOLD` | | +| Logical Reasoning | Closed QA | | `InfiniteBench_EnMC` | +| Logical Reasoning | Programming | | `InfiniteBench_CodeDebug` | +| Logical Reasoning | Reasoning | `BalancedCOPA`, `COPA`, `COPAEvalHarness`, `COPA_IDK`, `COPA_IDKEvalHarness`, `GOLDENSWAG`, `GOLDENSWAG_IDK`, `HELLASWAG`, `HELLASWAG_DE`, `HELLASWAG_EU20_DE`, `HELLASWAG_EU20_FR`, `HELLASWAG_IDK`, `WINOGRANDE`, `WINOGRANDE_IDK`, `WINOX_DE`, `WINOX_FR` | | + +## Long-Context + +| Task Name | Tag | Task | Capability | Domain | Common Few-Shot Counts | Avg #Words | Language | +| ------------------------------ | -------------------------------- | ----------------------------- | ------------------------ | ---------------- | ---------------------- | ---------- | -------- | +| Babilong | `Eval Suite Long Context` | Text Generation, Long Context | Completion, Long Context | ? | not supported | 22003 | en | +| InfiniteBench_CodeDebug | `InfiniteBench_CodeDebug` | LogicalReasoning | Programming | ? | not supported | 127761 | en | +| InfiniteBench_CodeRun | `InfiniteBench_CodeRun` | LogicalReasoning | Programming | ? | not supported | 34851 | en | +| InfiniteBench_EnDia | `InfiniteBench_EnDia` | TextDistillation | Closed QA | ? | not supported | 73240 | en | +| InfiniteBench_EnMC | `InfiniteBench_EnMC` | TextDistillation | Closed QA | ? | not supported | 139966 | en | +| InfiniteBench_EnQA | `InfiniteBench_EnQA` | TextDistillation | Closed QA | ? | not supported | 149442 | en | +| InfiniteBench_MathFind | `InfiniteBench_MathFind` | LogicalReasoning | Math | ? | not supported | 30017 | en | +| InfiniteBench_RetrieveKV2 | `InfiniteBench_RetrieveKV2` | TextDistillation | Extraction | ? | not supported | 5010 | en | +| InfiniteBench_RetrieveNumber | `InfiniteBench_RetrieveNumber` | TextDistillation | Extraction | ? | not supported | 99199 | en | +| InfiniteBench_RetrievePassKey1 | `InfiniteBench_RetrievePassKey1` | TextDistillation | Extraction | ? | not supported | 99196 | en | +| QuALITY | `QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | +| ZeroSCROLLS GovReport | `ZeroSCROLLS GovReport` | Text Distillation | QA | Government | not supported | 7273 | en | +| ZeroSCROLLS SQuALITY | `ZeroSCROLLS SQuALITY` | Text Distillation | QB-Summ? | Literature | not supported | 4971 | en | +| ZeroSCROLLS Qasper | `ZeroSCROLLS Qasper` | Text Distillation | QA | Science | not supported | 3531 | en | +| ZeroSCROLLS NarrativeQA | `ZeroSCROLLS NarrativeQA` | Text Distillation | QA | Literature, Film | not supported | 49384 | en | +| ZeroSCROLLS QuALITY | `ZeroSCROLLS QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | +| ZeroSCROLLS MuSiQue | `ZeroSCROLLS MuSiQue` | Text Distillation | QA | Wikipedia | not supported | 1749 | en | +| ZeroSCROLLS SpaceDigest | `ZeroSCROLLS SpaceDigest` | Text Distillation | Aggregation | Reviews | not supported | 5481 | en | +======= | **Task** | **Capability** | **Benchmarks** | **Long Context** | |-|-|-|-| | Logical Reasoning | Math | `AIME2024`, `AIME2025`, `GSM8K`, `GSM8K_EU20_DE`, `GSM8K_EU20_FR`, `GSM8KEvalHarness`, `GSM8KReasoning`, `MATH`, `MATH500`, `MATH500Minerva`, `MATHLvl5`, `MATHMinerva`, `MATHMinervaBPB`, `MATHMinervaEvalHarness`, `TableBench` | `InfiniteBench_MathFind` | @@ -52,6 +101,7 @@ Additional task documentation can be generated with the script `utils/generate-t | ZeroSCROLLS QuALITY | `ZeroSCROLLS QuALITY` | Text Distillation | QA | Literature, Misc | not supported | 4248 | en | | ZeroSCROLLS MuSiQue | `ZeroSCROLLS MuSiQue` | Text Distillation | QA | Wikipedia | not supported | 1749 | en | | ZeroSCROLLS SpaceDigest | `ZeroSCROLLS SpaceDigest` | Text Distillation | Aggregation | Reviews | not supported | 5481 | en | +>>>>>>> main ## Languages @@ -61,6 +111,47 @@ Additional task documentation can be generated with the script `utils/generate-t ## Metrics +<<<<<<< docs_polish +| Metrics Type | Metrics | +| ---------------------- | --------------------------------- | +| Completion Metrics | Accuracy | +| | Bleu | +| | Chrf | +| | Ter | +| | F1 | +| | Rouge 1 | +| | Rouge 2 | +| | Rouge-L | +| | Code Assertion | +| | Language Checker | +| | Length Checker | +| | Math Reasoning | +| | Placeholder Checker | +| | Text Counter | +| | CSV Format | +| | JSON Format | +| | Postscript Format | +| | Custom IFEval Checker | +| | Custom CWE Checker | +| | Custom NIAH Checker | +| | Custom Grid Comparison Checker | +| | Repetition Checker | +| Loglikelihood Metrics | Accuracy Loglikelihood | +| | Accuracy Normalized Loglikelihood | +| | Confidence-weighted Accuracy | +| | Probability Mass | +| | Probability Mass Normalized | +| (IDK-Specific Metrics) | Distributional Correctness Score | +| | Ternary Score | +| LLM Metrics | Chatbot Style Judge | +| | Completion Accuracy Judge | +| | Conciseness Judge | +| | Contains Names Judge | +| | Instruction Judge | +| | SQL Format | +| | World Knowledge Judge | +| Efficiency Metrics | Bytes per Sequence Position | +======= | Metrics Type | Metrics | |-|-------------------------------| | Completion Metrics | Accuracy @@ -97,3 +188,4 @@ Additional task documentation can be generated with the script `utils/generate-t || SQL Format || World Knowledge Judge | Efficiency Metrics | Bytes per Sequence Position | +>>>>>>> main