Skip to content

dylan-rodriquez/eval-framework

 
 

Repository files navigation

Aleph Alpha Eval-Framework

Comprehensive LLM evaluation at scale - A production-ready framework for evaluating large language models across 90+ benchmarks. eval-framework

Why Choose This Framework?

  • Scalability: Built for distributed evaluation. Currently providing an integration with Determined AI.
  • Extensibility: Easily add custom models, benchmarks, and metrics with object-oriented base classes.
  • Comprehensive: Comes pre-loaded with over 90 tasks covering a broad and diverse range, from reasoning and coding to safety and long-context. Also comes with a comprehensive set of metrics, including LLM-as-a-judge evaluations.

Other features

  • Flexible Model Integration: Supports models loaded via HuggingFace Transformers or custom implementations using the BaseLLM class.
  • Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
  • Custom Metrics: Easily define new metrics using the BaseMetric class.
  • Perturbation Testing: Robustness analysis with configurable perturbation types and probabilities.
  • Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
  • Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
  • Docker Support: Pre-configured Dockerfiles for local and distributed setups.

Quick Start

The codebase is tested and compatible with Python 3.12 and PyTorch 2.5. You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found here.

The easiest way to get started is by installing the library via pip and use it as an external dependency.

pip install eval_framework

There are optional extras available to unlock specific features of the library:

  • api for inference using the aleph-alpha client.
  • comet for the COMET metric.
  • determined for running jobs via determined.
  • mistral for inference on Mistral models.
  • transformers for inference using the transformers library.
  • vllm for inference via VLLM.

As a short hand, the all extra installs all of the above.

For development, you can instead install it directly from the repository. Please first install uv

To install the project with all optional extras use

uv sync --all-extras

We provide custom groups to control optional extras.

  • flash_attn: Install flash_attn with correct handling of build isolation

Thus, the following will setup the project with flash_attn

uv sync --all-extras --group flash_attn

To evaluate a single benchmark locally, you can use the following command:

eval_framework \
    --models src/eval_framework/llm/models.py \
    --llm-name Smollm135MInstruct \
    --task-name "GSM8K" \
    --output-dir ./eval \
    --num-fewshot 5 \
    --num-samples 10

For more detailed CLI usage instructions, see the CLI Usage Guide.

Benchmark Coverage & Task Categories

Core Capabilities

Reasoning Knowledge Coding Long Context
MMLU (57 subjects) TriviaQA HumanEval InfiniteBench
SQuAD v1/v2 MBPP
ARC Natural Questions CodeT5 ZeroSCROLLS
HellaSwag QuAC Programming QuALITY
Winogrande COPA Debugging

Languages & Domains

Multilingual Specialized Safety & Bias Efficiency
WMT Translation Legal (CaseHold) TruthfulQA Token counting
FLORES-200 Winogender Latency metrics
Multilingual MMLU Medical (MedQA) Stereotype detection Memory usage
German/Finnish tasks Scientific (SciQ) Harmful content Cost analysis

Completion

Tasks focused on logical reasoning, text distillation, instruction following, and output control. Examples include:

  • AIME 2024: Logical Reasoning (Math)
  • DUC Abstractive: Text Distillation (Extraction)
  • Custom Data: Complaint Summarization: Text Distillation (Summarization)

Loglikelihoods

Tasks emphasizing classification, reasoning, and open QA. Examples include:

  • Abstract Reasoning Challenge (ARC): Classification
  • Casehold: Open QA

Long-Context

Tasks designed for long-context scenarios, including QA, summarization, and aggregation. Examples include:

  • InfiniteBench_CodeDebug: Programming
  • ZeroSCROLLS GovReport: QA (Government)

Metrics

Evaluation metrics include:

  • Completion Metrics: Accuracy, Bleu, F1, Rouge
  • Loglikelihood Metrics: Accuracy Loglikelihood, Probability Mass
  • LLM Metrics: Chatbot Style Judge, Instruction Judge
  • Efficiency Metrics: Bytes per Sequence Position

For the full list of tasks and metrics, see Detailed Task Table.

Getting Started

Understanding the Evaluation Framework

Eval-Framework provides a unified interface for evaluating language models across diverse benchmarks. The framework follows this interaction model:

  1. Define Your Model - Specify which model to evaluate (HuggingFace, API, or custom)
  2. Choose Your Task - Select from 150+ available benchmarks or create custom ones
  3. Configure Evaluation - Set parameters like few-shot examples, sample count, and output format
  4. Run Evaluation - Execute locally via CLI/script or distribute via Determined AI
  5. Analyze Results - Review detailed JSON outputs, metrics, and generated reports

Core Components

  • Models: Defined via BaseLLM interface (HuggingFace, OpenAI, custom APIs)
  • Tasks: Inherit from BaseTask (completion, loglikelihood, or LLM-judge based)
  • Metrics: Automatic scoring via BaseMetric classes
  • Formatters: Handle prompt construction and model-specific formatting
  • Results: Structured outputs with sample-level details and aggregated statistics

Your First Evaluation

  1. Install the framework (see Quick Start above)
pip install eval_framework[transformers]
  1. Create and run your first evaluation using HuggingFace model:

     from pathlib import Path
    
     from eval_framework.llm.huggingface import HFLLM
     from eval_framework.main import main
     from eval_framework.tasks.eval_config import EvalConfig
     from template_formatting.formatter import HFFormatter
    
     # Define your model
     class MyHuggingFaceModel(HFLLM):
         LLM_NAME = "microsoft/DialoGPT-medium"
         DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")
    
     if __name__ == "__main__":
         # Initialize your model
         llm = MyHuggingFaceModel()
    
         # Running evaluation on GSM8K task using 5 few-shot examples and 10 samples
         config = EvalConfig(
             output_dir=Path("./eval_results"),
             num_fewshot=5,
             num_samples=10,
             task_name="GSM8K",
             llm_class=MyHuggingFaceModel,
         )
    
         # Run evaluation and get results
         results = main(llm=llm, config=config)
  2. Review results - Check ./eval_results/ for detailed outputs and use our results guide to interpret them

Next Steps

Documentation

Getting Started

Advanced Usage

Scaling & Production

Contributing

Citation

If you use eval-framework in your research, please cite:

@software{eval_framework,
  title={Aleph Alpha Eval Framework},
  year={2025},
  url={https://github.com/Aleph-Alpha-Research/eval-framework}
}

License

This project is licensed under the Apache License 2.0.



This project has received funding from the European Union’s Digital Europe Programme under grant agreement No. 101195233 (OpenEuroLLM).

The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.

OELLM 1 OELLM 2

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Other 0.3%