CALAMITA Evaluation


	News: The Calamita's leaderboard is public!
	We are still working on the official Calamita's website, stay tuned!
	The paper is available on ArXiv

This repository contains scripts and utilities to run and reproduce the CALAMITA results.

Getting started

Create an isolated Python environment and install the submodule lm-eval-harness. We use a fork of the official lm-eval-harness offering additional functionalities, including local dataset loading, cleaning up VRAM after generation to make room for eval code that uses LLMs, and specifying custom aggregation functions. Then, install the requirements listed under requirements.txt.

Assuming your environment is a virtualenv/uv environment called venv, the commands after the env creation should resemble:

source venv/bin/activate
pip install -e './lm-eval-harness[vllm]'
pip install -r requirements.txt

(optional) Prepare Files Locally

If your computing environment does not have internet access, use the two scripts bash/download_datasets.sh and bash/download_models.sh to download task data and models locally.

Important

Note: at the time of writing, a small number of the CALAMITA (sub)tasks are not formatted as Hugging Face datasets and require local loading. Please see the task-specific instructions below to run them.

Running a model on a task

Select a list of subtasks and put them into a txt file, one per line, e.g., in a tasks.txt (you can find an example file in the root directory).
Schedule the job through SLURM, e.g.,

sbatch ./bash/run_model.sh meta-llama/Llama-3.1-70B-Instruct tasks.txt

If you do not use SLURM, the file ./bash/run_model.sh requires minimal adaptation. Please open an issue if you need troubleshooting.

Task-specific instructions

All task configurations are stored under ./tasks in a separate folder. Each task can be composed of multiple subtasks, listed in a group key. For example, Eureka Rebus (./tasks/eureka_rebus/defaul.yaml) is composed of eureka_original and eureka_hints subtasks. You can specify a single subtask or a group name, which will schedule all subtasks belonging to it.

Most tasks will be run as standard lm-eval tasks, using their names as keys as defined in the configuration file. For example, to run Eureka Rebus and Abricot, including all their subtasks, create a text file as:

abricot
eureka_rebus

and schedule a job with the command indicated in the previous question.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
bash		bash
fixcalamita		fixcalamita
img		img
lm-evaluation-harness @ 4c643ec		lm-evaluation-harness @ 4c643ec
results		results
scripts		scripts
tasks		tasks
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
tasks.txt		tasks.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CALAMITA Evaluation

Getting started

(optional) Prepare Files Locally

Running a model on a task

Task-specific instructions

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CALAMITA Evaluation

Getting started

(optional) Prepare Files Locally

Running a model on a task

Task-specific instructions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages