![]() |
News: The Calamita's leaderboard is public! |
| We are still working on the official Calamita's website, stay tuned! | |
![]() |
The paper is available on ArXiv |
This repository contains scripts and utilities to run and reproduce the CALAMITA results.
Create an isolated Python environment and install the submodule lm-eval-harness. We use a fork of the official lm-eval-harness offering additional functionalities, including local dataset loading, cleaning up VRAM after generation to make room for eval code that uses LLMs, and specifying custom aggregation functions. Then, install the requirements listed under requirements.txt.
Assuming your environment is a virtualenv/uv environment called venv, the commands after the env creation should resemble:
source venv/bin/activate
pip install -e './lm-eval-harness[vllm]'
pip install -r requirements.txtIf your computing environment does not have internet access, use the two scripts bash/download_datasets.sh and bash/download_models.sh to download task data and models locally.
Important
Note: at the time of writing, a small number of the CALAMITA (sub)tasks are not formatted as Hugging Face datasets and require local loading. Please see the task-specific instructions below to run them.
- Select a list of subtasks and put them into a txt file, one per line, e.g., in a
tasks.txt(you can find an example file in the root directory). - Schedule the job through SLURM, e.g.,
sbatch ./bash/run_model.sh meta-llama/Llama-3.1-70B-Instruct tasks.txtIf you do not use SLURM, the file ./bash/run_model.sh requires minimal adaptation. Please open an issue if you need troubleshooting.
All task configurations are stored under ./tasks in a separate folder. Each task can be composed of multiple subtasks, listed in a group key. For example, Eureka Rebus (./tasks/eureka_rebus/defaul.yaml) is composed of eureka_original and eureka_hints subtasks.
You can specify a single subtask or a group name, which will schedule all subtasks belonging to it.
Most tasks will be run as standard lm-eval tasks, using their names as keys as defined in the configuration file. For example, to run Eureka Rebus and Abricot, including all their subtasks, create a text file as:
abricot
eureka_rebusand schedule a job with the command indicated in the previous question.


