Skip to content
Open
7 changes: 2 additions & 5 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,14 @@
- ✅ Provide tests for your changes.
- 📝 Use descriptive commit messages.
- 📗 Update any related documentation and include any relevant screenshots.
- 📗 Reflect the changes you made in the changelog.

-->


## PR Checklist

- [ ] Use descriptive commit messages.
- [ ] Provide tests for your changes.
- [ ] Update any related documentation and include any relevant screenshots.
- [ ] Check if changes need to be made to docs (README or any guides in `/docs/`).
- [ ] Reflect the changes you made in the changelog.

## What type of PR is this? (check all applicable)

Expand Down Expand Up @@ -52,8 +49,8 @@ _Please replace this line with instructions on how to test your changes, a note
on the hardware and config this has been tested on, as well as any relevant
additional information._


## Added/updated tests?

- [ ] Yes
- [ ] No, and this is why: _please replace this line with details on why tests
have not been included_
Expand Down
49 changes: 42 additions & 7 deletions docs/add_new_benchmark_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the
## Overview

The eval-framework supports two response types:

1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation)
2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options

For detailed information about implementing each task type, please refer to:

- [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks
- [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks

Expand Down Expand Up @@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample

This section provides a complete reference for all configurations available when creating benchmarks.


### Response Types

The response type determines how your model interacts with the task and what type of output is expected.
Expand All @@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS

Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.


#### Completion Metrics

These metrics work with generated text outputs from COMPLETION tasks:
Expand Down Expand Up @@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas

These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:


```python
from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
# Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
Expand Down Expand Up @@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn

```


## Implementation Examples and Patterns

### Practical Example: GeographyQATask
Expand Down Expand Up @@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]):
return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot)
```


### Add to Task Registry

Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`:
Expand All @@ -280,32 +277,70 @@ The task will now be available through `get_task("GeographyQA")`.

### Testing your benchmark

All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make a note here that it can be run for individual tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by 5df2fd5.
@tfburns could you resolve it, if it is satisfactory?

All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.

> [!TIP]
> CI runs this test across all registered tasks, but during development you can target only your task, for example:
> `uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -k "YourTaskName"`

#### Automatic Formatting Tests
All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.

All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.

The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`.

When you add a new task:

1. Run the formatter hash test once for your task to generate/check hashes.
2. If your task hash is new, it will be added to `task-prompts-hashes.json`.
3. Commit the updated JSON file together with your task changes.

Run the formatter hash test only for your newly created task (replace `YourTaskName`):

```bash
uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName"
```

#### Custom Task Tests (Optional)

If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it.

### Update benchmark documentation

After adding a benchmark, you also need to update task documentation:

1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them).
2. Regenerate the task docs:

```bash
uv run -m eval_framework.utils.generate_task_docs
```

This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`.

## Benchmark Examples by Task Type

Study these existing benchmarks in the codebase for more complex patterns:

#### Simple Classification Tasks

- **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods
- **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects

#### Reasoning Tasks

- **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns

#### Code Generation

- **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation
- **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation

#### Long Context Tasks

- **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks

#### Custom Format Tasks

- **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation
- **JSON/CSV Tasks:** Custom format validation examples
Loading