Releases · Aleph-Alpha-Research/eval-framework

04 Feb 13:13

github-actions

v0.2.12

5983e24

v0.2.12 Latest

Latest

0.2.12 (2026-02-04)

Features

add "top_p" param to AlephAlphaAPIModel (#168) (e52c927)
Bump datasets to >=4.0.0 and remove all trust_remote_code references. (#158) (c383806)

Assets 2

30 Jan 13:30

github-actions

v0.2.11

5244b02

v0.2.11

0.2.11 (2026-01-30)

Bug Fixes

Downloaded w&b artifacts are deleted too early (#163) (157d757)
use aleph-alpha-client concurrency limit and allow >100 concurrent requests (#166) (73b7d97)
VLLM tokenizer lazy initialization didn't work with W&B (#165) (f38de79)

Assets 2

27 Jan 13:22

github-actions

v0.2.10

abc4aa6

v0.2.10

0.2.10 (2026-01-27)

Bug Fixes

prefix dataset paths with hf user id for all tasks that did not have it before (#160) (d5dc178)

Assets 2

15 Jan 13:28

github-actions

v0.2.9

7456916

v0.2.9

0.2.9 (2026-01-15)

Features

add repeats to eval-config (#150) (cb9f860)
add AIME25 benchmark task (#152) (3ef01fc)

Bug Fixes

docker push on release has one too many 'v's in the tag name (#153) (99e6096)

Assets 2

09 Jan 15:44

github-actions

v0.2.8

c67338c

v0.2.8

0.2.8 (2026-01-09)

Bug Fixes

normalize math reasoning (#148) (73a8843)
removed github token from release-please and update image links (#147) (74d59ea)

Assets 2

08 Jan 12:41

github-actions

v0.2.7

635d208

v0.2.7

0.2.7 (2026-01-08)

Features

add position randomization for LLM pairwise judges (#135) (e4ed3ec)
added automated documentation through CI and Sphinx (#127) (46ef6b3)
added badges to github readme to link pypi and docs pages (#139) (778bad2)
pass AA_TOKEN and AA_INFERENCE_ENDPOINT in the AA model constructor (#134) (93267b6)

Bug Fixes

docs: resolve broken source links (#132) (c0e37b2)
release-please pushes docker to registry and triggers tests (#138) (d291bb4)

Documentation

added documentation for running tests and expected runtimes (#133) (77fd1d3)

Assets 2

15 Dec 10:55

jordisassoon

v0.2.6

d7958ba

0.2.6

What's Changed

chore: Update uv and uv_build backend to version >= 0.9 by @wsascha in #126
fix: prevent leaking timeout errors by @MaxHam in #130
chore: version update and CHANGELOG by @jordisassoon in #131

New Contributors

@wsascha made their first contribution in #126
@MaxHam made their first contribution in #130

Full Changelog: v0.2.5...v0.2.6

Contributors

wsascha, jordisassoon, and MaxHam

Assets 2

08 Dec 11:57

jordisassoon

v0.2.5

bfa479c

0.2.5

What's Changed

change qwen3 8b to qwen3 0.6b by @AhmedHammam-AA in #117
IDK tasks by @tfburns in #102
Fix typo in prompt template for German MT Bench by @tfburns in #109
optimize build ci job by @AhmedHammam-AA in #118
feat: switch logprobs to completions endpoint for aa client by @MaxMeuer in #120
fix: updated image urls to be absolute for pypi by @jordisassoon in #121
enable parallel test execution for test-cpu by @AhmedHammam-AA in #107
loose ends on API models by @GrS-AA in #111
bump version for release 0.2.5 by @jordisassoon in #124

New Contributors

@MaxMeuer made their first contribution in #120

Full Changelog: v0.2.4...v0.2.5

Contributors

tfburns, jordisassoon, and 3 other contributors

Assets 2

26 Nov 14:46

jordisassoon

v0.2.4

91888b5

0.2.4

What's Changed

fix tasks docs and squad / improvements of OpenAI API models by @GrS-AA in #97
added query template to hendrycks math task class by @jordisassoon in #112
Verbosity flag controls output for minimal/maximal cli output by @jordisassoon in #110
Add AidanBench by @JohannesMessnerAA in #113
adding openai as extra in hf cache CI pipeline by @jordisassoon in #115
bump version for release 0.2.4 by @jordisassoon in #114
correct typo in prompt and add AidanBenchOriginal by @JohannesMessnerAA in #116
hotfix by @jordisassoon in #119

New Contributors

@JohannesMessnerAA made their first contribution in #113

Full Changelog: v0.2.3...v0.2.4

Contributors

jordisassoon, GrS-AA, and JohannesMessnerAA

Assets 2

14 Nov 09:37

AhmedHammam-AA

v0.2.3

14fff42

0.2.3

What's Changed

Post 0.2.2 by @Michael-JB in #84
Use artifacts from WANDB_ADDITIONAL_ARTIFACT_REFERENCES by @mys007 in #79
GPU resource share by @okmaar in #82
Fix #79 by @mys007 in #86
SciQ, TruthfulQA fix and GSM8K renaming by @kcoost in #81
Add WandbUploader and refactored HFUploader by @mys007 in #87
Reactivate sphyr formatter test by @mys007 in #88
Fix missing imports for pip install eval_framework[transformers] by @jordisassoon in #91
Feature/add model specific post processing of completion by @FelixReinfurtAA in #92
(MTBench) fixing exception raising; key error will never raise by @dylan-rodriquez in #93
Add three new loglikelihood metrics by @tfburns in #85
Updating README and other docs to ensure correct example usage. by @jordisassoon in #89
Refactor vllm+hfllm interfaces & improve wandb by @mys007 in #90
Add exact_match check to json_format metric by @tillspeicher in #96
Bugfix: wandb-uploader: Add length limit to artifact name by @mys007 in #98
add generate_from_samples function and codepath by @CoEich in #100
fixed pip install eval_framework[all] by @jordisassoon in #95
OpenBookQA is now really openbook by @jordisassoon in #104
Wandb uploader: fix ARTIFACT_NAME_MAXLEN by @mys007 in #106
Maximum generated tokens for the same benchmark varies per tokenizer fertility by @jordisassoon in #105
add docker images for each release by @AhmedHammam-AA in #103
restructuring tests by @AhmedHammam-AA in #101

New Contributors

@jordisassoon made their first contribution in #91

Full Changelog: v0.2.2...v0.2.3

Contributors

tillspeicher, mys007, and 9 other contributors

Assets 2

Releases: Aleph-Alpha-Research/eval-framework

v0.2.12

0.2.12 (2026-02-04)

Features

Uh oh!

v0.2.11

0.2.11 (2026-01-30)

Bug Fixes

Uh oh!

v0.2.10

0.2.10 (2026-01-27)

Bug Fixes

Uh oh!

v0.2.9

0.2.9 (2026-01-15)

Features

Bug Fixes

Uh oh!

v0.2.8

0.2.8 (2026-01-09)

Bug Fixes

Uh oh!

v0.2.7

0.2.7 (2026-01-08)

Features

Bug Fixes

Documentation

Uh oh!

0.2.6

What's Changed

New Contributors

Contributors

Uh oh!

0.2.5

What's Changed

New Contributors

Contributors

Uh oh!

0.2.4

What's Changed

New Contributors

Contributors

Uh oh!

0.2.3

What's Changed

New Contributors

Contributors

Uh oh!