Skip to content

CALAMITA-AILC/calamita-eval

Repository files navigation

CALAMITA Evaluation

CALAMITA LOGO


News: The Calamita's leaderboard is public!
We are still working on the official Calamita's website, stay tuned!
The paper is available on ArXiv



This repository contains scripts and utilities to run and reproduce the CALAMITA results.

Getting started

Create an isolated Python environment and install the submodule lm-eval-harness. We use a fork of the official lm-eval-harness offering additional functionalities, including local dataset loading, cleaning up VRAM after generation to make room for eval code that uses LLMs, and specifying custom aggregation functions. Then, install the requirements listed under requirements.txt.

Assuming your environment is a virtualenv/uv environment called venv, the commands after the env creation should resemble:

source venv/bin/activate
pip install -e './lm-eval-harness[vllm]'
pip install -r requirements.txt

(optional) Prepare Files Locally

If your computing environment does not have internet access, use the two scripts bash/download_datasets.sh and bash/download_models.sh to download task data and models locally.

Important

Note: at the time of writing, a small number of the CALAMITA (sub)tasks are not formatted as Hugging Face datasets and require local loading. Please see the task-specific instructions below to run them.

Running a model on a task

  1. Select a list of subtasks and put them into a txt file, one per line, e.g., in a tasks.txt (you can find an example file in the root directory).
  2. Schedule the job through SLURM, e.g.,
sbatch ./bash/run_model.sh meta-llama/Llama-3.1-70B-Instruct tasks.txt

If you do not use SLURM, the file ./bash/run_model.sh requires minimal adaptation. Please open an issue if you need troubleshooting.

Task-specific instructions

All task configurations are stored under ./tasks in a separate folder. Each task can be composed of multiple subtasks, listed in a group key. For example, Eureka Rebus (./tasks/eureka_rebus/defaul.yaml) is composed of eureka_original and eureka_hints subtasks. You can specify a single subtask or a group name, which will schedule all subtasks belonging to it.

Most tasks will be run as standard lm-eval tasks, using their names as keys as defined in the configuration file. For example, to run Eureka Rebus and Abricot, including all their subtasks, create a text file as:

abricot
eureka_rebus

and schedule a job with the command indicated in the previous question.

About

Code for the CALAMITA evaluation framework of Italian datasets.

Topics

Resources

License

Stars

Watchers

Forks

Contributors