Releases: Aleph-Alpha-Research/eval-framework
Releases · Aleph-Alpha-Research/eval-framework
v0.2.12
v0.2.11
v0.2.10
v0.2.9
v0.2.8
v0.2.7
0.2.7 (2026-01-08)
Features
- add position randomization for LLM pairwise judges (#135) (e4ed3ec)
- added automated documentation through CI and Sphinx (#127) (46ef6b3)
- added badges to github readme to link pypi and docs pages (#139) (778bad2)
- pass AA_TOKEN and AA_INFERENCE_ENDPOINT in the AA model constructor (#134) (93267b6)
Bug Fixes
- docs: resolve broken source links (#132) (c0e37b2)
- release-please pushes docker to registry and triggers tests (#138) (d291bb4)
Documentation
0.2.6
What's Changed
- chore: Update uv and uv_build backend to version >= 0.9 by @wsascha in #126
- fix: prevent leaking timeout errors by @MaxHam in #130
- chore: version update and CHANGELOG by @jordisassoon in #131
New Contributors
Full Changelog: v0.2.5...v0.2.6
0.2.5
What's Changed
- change qwen3 8b to qwen3 0.6b by @AhmedHammam-AA in #117
- IDK tasks by @tfburns in #102
- Fix typo in prompt template for German MT Bench by @tfburns in #109
- optimize build ci job by @AhmedHammam-AA in #118
- feat: switch logprobs to completions endpoint for aa client by @MaxMeuer in #120
- fix: updated image urls to be absolute for pypi by @jordisassoon in #121
- enable parallel test execution for test-cpu by @AhmedHammam-AA in #107
- loose ends on API models by @GrS-AA in #111
- bump version for release 0.2.5 by @jordisassoon in #124
New Contributors
Full Changelog: v0.2.4...v0.2.5
0.2.4
What's Changed
- fix tasks docs and squad / improvements of OpenAI API models by @GrS-AA in #97
- added query template to hendrycks math task class by @jordisassoon in #112
- Verbosity flag controls output for minimal/maximal cli output by @jordisassoon in #110
- Add AidanBench by @JohannesMessnerAA in #113
- adding openai as extra in hf cache CI pipeline by @jordisassoon in #115
- bump version for release 0.2.4 by @jordisassoon in #114
- correct typo in prompt and add AidanBenchOriginal by @JohannesMessnerAA in #116
- hotfix by @jordisassoon in #119
New Contributors
- @JohannesMessnerAA made their first contribution in #113
Full Changelog: v0.2.3...v0.2.4
0.2.3
What's Changed
- Post 0.2.2 by @Michael-JB in #84
- Use artifacts from WANDB_ADDITIONAL_ARTIFACT_REFERENCES by @mys007 in #79
- GPU resource share by @okmaar in #82
- Fix #79 by @mys007 in #86
- SciQ, TruthfulQA fix and GSM8K renaming by @kcoost in #81
- Add WandbUploader and refactored HFUploader by @mys007 in #87
- Reactivate sphyr formatter test by @mys007 in #88
- Fix missing imports for pip install eval_framework[transformers] by @jordisassoon in #91
- Feature/add model specific post processing of completion by @FelixReinfurtAA in #92
- (MTBench) fixing exception raising; key error will never raise by @dylan-rodriquez in #93
- Add three new loglikelihood metrics by @tfburns in #85
- Updating README and other docs to ensure correct example usage. by @jordisassoon in #89
- Refactor vllm+hfllm interfaces & improve wandb by @mys007 in #90
- Add exact_match check to json_format metric by @tillspeicher in #96
- Bugfix: wandb-uploader: Add length limit to artifact name by @mys007 in #98
- add generate_from_samples function and codepath by @CoEich in #100
- fixed pip install eval_framework[all] by @jordisassoon in #95
- OpenBookQA is now really openbook by @jordisassoon in #104
- Wandb uploader: fix ARTIFACT_NAME_MAXLEN by @mys007 in #106
- Maximum generated tokens for the same benchmark varies per tokenizer fertility by @jordisassoon in #105
- add docker images for each release by @AhmedHammam-AA in #103
- restructuring tests by @AhmedHammam-AA in #101
New Contributors
- @jordisassoon made their first contribution in #91
Full Changelog: v0.2.2...v0.2.3