I have test Ultravox-v0.5-LLaMA-3.1-8B too.but My test results are slightly different from yours, especially on the sdqa dataset.
| |
AlpacaEval |
CommonEval |
SD-QA |
OpenBookQA |
IFEval |
AdvBench |
| Open-Ended QA |
Open-Ended QA |
Reference-Based QA |
Multiple-Choice QA |
Instruction Following |
Safety |
|
| samples |
199 |
200 |
553 |
455 |
345 |
520 |
| Ultravox0.5 LLama3.1 8B Instruct |
4.75 |
4.08 |
72.42 |
69.01 |
68.05 |
98.84 |
